# Logging

As reinforcement learning algorithms are historically challenging to debug, it's important to pay careful attention to logging.
By default, the TRL [`PPOTrainer`] saves a lot of relevant information to `wandb` or `tensorboard`.

Upon initialization, pass one of these two options to the [`PPOConfig`]:
```
config = PPOConfig(
    model_name=args.model_name,
    log_with=`wandb`, # or `tensorboard`
)
```
If you want to log with tensorboard, add the kwarg `project_kwargs={"logging_dir": PATH_TO_LOGS}` to the PPOConfig.

## PPO Logging

### Crucial values
During training, many values are logged, here are the most important ones:

1. `env/reward_mean`,`env/reward_std`, `env/reward_dist`: the properties of the reward distribution from the "environment".
2. `ppo/mean_scores`: The mean scores directly out of the reward model.
3. `ppo/mean_non_score_reward`: The mean negated KL penalty during training (shows the delta between the reference model and the new policy over the batch in the step)

### Training stability parameters:
Here are some parameters that are useful to monitor for stability (when these diverge or collapse to 0, try tuning variables):

1. `ppo/loss/value`: The value function loss -- will spike / NaN when not going well.
2. `ppo/val/clipfrac`: The fraction of clipped values in the value function loss. This is often from 0.3 to 0.6.
3. `objective/kl_coef`: The target coefficient with [`AdaptiveKLController`]. Often increases before numerical instabilities.